NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

How Well Do Large Language Models Understand Tables in Materials Science?

https://doi.org/10.1007/s40192-024-00362-6

Circi, Defne; Khalighinejad, Ghazal; Chen, Anlan; Dhingra, Bhuwan; Brinson, L Catherine (July 2024, Integrating Materials and Manufacturing Innovation)

Advances in materials science require leveraging past findings and data from the vast published literature. While some materials data repositories are being built, they typically rely on newly created data in narrow domains because extracting detailed data and metadata from the enormous wealth of publications is immensely challenging. The advent of large language models (LLMs) presents a new opportunity to rapidly and accurately extract data and insights from the published literature and transform it into structured data formats for easy query and reuse. In this paper, we build on initial strategies for using LLMs for rapid and autonomous data extraction from materials science articles in a format curatable by materials databases. We presented the subdomain of polymer composites as our example use case and demonstrated the success and challenges of LLMs on extracting tabular data. We explored diferent table representations for use with LLMs, fnding that a multimodal model with an image input yielded the most promising results. This model achieved an accuracy score of 0.910 for composition information extraction and an F1 score of 0.863 for property name information extraction. With the most conservative evaluation for the property extraction requiring exact match in all the details, we obtained an F1 score of 0.419. We observed that by allowing varying degrees of fexibility in the evaluation, the score can increase to 0.769. We envision that the results and analysis from this study will promote further research directions in developing information extraction strategies from materials information sources.
more » « less
Full Text Available
BirdieDNA: Reward-Based Pre-Training for Genomic Sequence Modeling

Blouir, Samuel; Circi, Defne; Moldwin, Asher; Shehu, Amarda (April 2024, ICLR MLGenX Workshop)

Transformer-based language models have shown promise in genomics but face challenges unique to DNA, such as sequence lengths spanning hundreds of millions of base pairs and subtle long-range dependencies. Although next-token prediction remains the predominant pre-training objective (inherited from NLP), recent research suggests that multi-objective frameworks can better capture complex structure. In this work, we explore whether the Birdie framework, a reinforcement learning-based, mixture-of-objectives pre-training strategy, can similarly benefit genomic foundation models. We compare a slightly modified Birdie approach with a purely autoregressive, next token prediction baseline on standard Nucleotide Transformer benchmarks. Our results show performance gains in the DNA domain, indicating that mixture-of-objectives training could be a promising alternative to next token prediction only pre-training for genomic sequence modeling.
more » « less
Full Text Available
32 examples of LLM applications in materials science and chemistry: towards automation, assistants, agents, and accelerated scientific discovery

https://doi.org/10.1088/2632-2153/ae011a

Zimmermann, Yoel; Bazgir, Adib; Al-Feghali, Alexander; Ansari, Mehrad; Bocarsly, Joshua; Brinson, L Catherine; Chiang, Yuan; Circi, Defne; Chiu, Min-Hsueh; Daelman, Nathan; et al (August 2025, Machine Learning: Science and Technology)

Abstract Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.
more » « less
Full Text Available
34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

Zimmermann, Yoel; Bazgir, Adib; Al-Feghali, Alexander; Ansari, Mehrad; Bocarsly, Joshua; Brinson, L_Catherine; Chiang, Yuan; Circi, Defne; Chiu, Min-Hsueh; Daelman, Nathan; et al (May 2025, ArXiv)

Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.
more » « less
Full Text Available
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon

https://doi.org/10.1039/d3dd00113j

Jablonka, Kevin Maik; Ai, Qianxiang; Al-Feghali, Alexander; Badhwar, Shruti; Bocarsly, Joshua D.; Bran, Andres M.; Bringuier, Stefan; Brinson, L. Catherine; Choudhary, Kamal; Circi, Defne; et al (August 2023, Digital Discovery)

Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of molecules and materials, designing novel interfaces for tools, extracting knowledge from unstructured data, and developing new educational applications. The diverse topics and the fact that working prototypes could be generated in less than two days highlight that LLMs will profoundly impact the future of our fields. The rich collection of ideas and projects also indicates that the applications of LLMs are not limited to materials science and chemistry but offer potential benefits to a wide range of scientific disciplines.
more » « less
Full Text Available

Search for: All records